Distance phenomena in high-dimensional chemical descriptor spaces: Consequences for similarity-based approaches

نویسندگان

  • Matthias Rupp
  • Petra Schneider
  • Gisbert Schneider
چکیده

Measuring the (dis)similarity of molecules is important for many cheminformatics applications like compound ranking, clustering, and property prediction. In this work, we focus on real-valued vector representations of molecules (as opposed to the binary spaces of fingerprints). We demonstrate the influence which the choice of (dis)similarity measure can have on results, and provide recommendations for such choices. We review the mathematical concepts used to measure (dis)similarity in vector spaces, namely norms, metrics, inner products, and, similarity coefficients, as well as the relationships between them, employing (dis)similarity measures commonly used in cheminformatics as examples. We present several phenomena (empty space phenomenon, sphere volume related phenomena, distance concentration) in high-dimensional descriptor spaces which are not encountered in two and three dimensions. These phenomena are theoretically characterized and illustrated on both artificial and real (bioactivity) data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hyperspectral Image Classification Based on the Fusion of the Features Generated by Sparse Representation Methods, Linear and Non-linear Transformations

The ability of recording the high resolution spectral signature of earth surface would be the most important feature of hyperspectral sensors. On the other hand, classification of hyperspectral imagery is known as one of the methods to extracting information from these remote sensing data sources. Despite the high potential of hyperspectral images in the information content point of view, there...

متن کامل

Consequences Modeling of the Akçagaz Accident through Land Use Planning (LUP) Approach

In the study, consequences analysis of Akçagaz LPG Facilities accident was conducted.  The consequences analysis, modeling studies were performed by the use of EFFECTS 10.0 Software over two liquefied gas LOC (Loss of Containment) scenarios. One of the scenarios was G1: Instantaneous release corresponding to BLEVE (Boiling Liquid Expanding Vapor Explosion) and the other was G2: Release in 1...

متن کامل

Aspects of Metric Spaces in Computation

Metric spaces, which generalise the properties of commonly-encountered physical and abstract spaces into a mathematical framework, frequently occur in computer science applications. Three major kinds of questions about metric spaces are considered here: the intrinsic dimensionality of a distribution, the maximum number of distance permutations, and the difficulty of reverse similarity search. I...

متن کامل

PAC Nearest Neighbor Queries: Using the Distance Distribution for Searching in High-Dimensional Metric Spaces

In this paper we introduce a new paradigm for similarity search, called PAC-NN (probably approximately correct nearest neighbor) queries, aiming to break the “dimensionality curse” which inhibits current approaches to be applied in high-dimensional spaces. PAC-NN queries return, with probability at least 1− δ, a (1+ )-approximate NN – an object whose distance from the query q is less than (1 + ...

متن کامل

CoFD : An Algorithm for Non-distance Based Clustering in High Dimensional Spaces

The clustering problem, which aims at identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity clusters, has been widely studied. Traditional clustering algorithms use distance functions to measure similarity and are not suitable for high dimensional spaces. In this paper, we propose CoFD algorithm, which is a non-dis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of computational chemistry

دوره 30 14  شماره 

صفحات  -

تاریخ انتشار 2009